22 research outputs found

    IMPROVING WORD SEGMENTATION FOR THAI SPEECH TRANSLATION

    Get PDF
    A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, Maximal Matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42 % BLEU points compared to the baseline system

    Automatic Sentence Break Disambiguation for Thai

    No full text
    Unlike English, there is no explicit sentence marker in Thai language. Conventionally, a space is placed at the end of the sentence when written in Thai. But it does not mean that a space always indicates the sentence boundary. In this paper, we propose the algorithm, which is a feature-based approach, to extract sentences from a paragraph by detecting the appropriate sentence breaking spaces. The algorithm considers the context around a space for determining the space as whether a sentence breaking space or not. The previous method, probabilistic POS trigram approach, considers only the coarse information of part-of-speech in a limited range of context whereas the feature-based approach considers as many features as possible. A feature can be anything that examines a specific information in the context around the target word sequence, such as context words and collocations. To automatically extract such features from a training corpus, we employ the learning algorithm, namely Winnow. The experimental results showed the effectiveness of Winnow comparing with POS trigram, and also demonstrated that Winnow is superior to POS trigram in our task

    Example-Based Grapheme-to-Phoneme Conversion for Thai

    No full text
    Several characteristics of the Thai writing system make Thai grapheme-to-phoneme (G2P) conversion very challenging. In this paper, we propose an Example-Based Grapheme-to-Phoneme conversion approach. It generates the pronunciation of a word by selecting, modifying and combining pronunciations from syllables from training corpus. The best system achieves 80.99 % word accuracy and 94.19 % phone accuracy which significantly outperform previous approaches for Thai

    Thai Grapheme-Based Speech Recognition

    Get PDF
    In this paper we present the results for building a grapheme-based speech recognition system for Thai. We experiment with different settings for the initial context independent system, different number of acoustic models and different contexts for the speech unit. In addition, we investigate the potential of an enhanced tree clustering method as a way of sharing parameters across models. We compare our system with two phoneme-based systems; one that uses a hand-crafted dictionary and another that uses an automatically generated dictionary. Experiment results show that the grapheme-based system with enhanced tree clustering outperforms the phoneme-based system using an automatically generated dictionary, and has comparable results to the phoneme-based system with the handcrafted dictionary.

    A Context-Sensitive Homograph Disambiguation in Thai Text-to-Speech Synthesis

    No full text
    Homograph ambiguity is an original issue in Text-to-Speech (TTS). To disambiguate homograph, several efficient approaches have been proposed such as part-of-speech (POS) n-gram, Bayesian classifier, decision tree, and Bayesian-hybrid approaches. These methods need words or/and POS tags surrounding the question homographs in disambiguation. Some languages such as Thai, Chinese, and Japanese have no word-boundary delimiter. Therefore before solving homograph ambiguity, we need to identify word boundaries. In this paper, we propose a unique framework that solves both word segmentation and homograph ambiguity problems altogether. Our model employs both local and longdistance contexts, which are automatically extracted by a machine learning technique called Winnow.

    Feature-based Thai Word Segmentation

    No full text
    Word segmentation is a problem in several Asian languages that have no explicit word boundary delimiter, e.g. Chinese, Japanese, Korean and Thai. We propose to use featurebased approaches for Thai word segmentation
    corecore